Picture for Robert Kirk

Robert Kirk

Evaluating whether AI models would sabotage AI safety research

Add code
Apr 27, 2026
Viaarxiv icon

Propensity Inference: Environmental Contributors to LLM Behaviour

Add code
Apr 22, 2026
Viaarxiv icon

UK AISI Alignment Evaluation Case-Study

Add code
Apr 01, 2026
Viaarxiv icon

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

Add code
Feb 11, 2026
Viaarxiv icon

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Add code
Oct 08, 2025
Viaarxiv icon

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Add code
Jul 28, 2025
Viaarxiv icon

Reward Model Overoptimisation in Iterated RLHF

Add code
May 23, 2025
Viaarxiv icon

An Example Safety Case for Safeguards Against Misuse

Add code
May 23, 2025
Viaarxiv icon

Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

Add code
Feb 24, 2025
Viaarxiv icon

How Do Large Language Monkeys Get Their Power (Laws)?

Add code
Feb 24, 2025
Figure 1 for How Do Large Language Monkeys Get Their Power (Laws)?
Figure 2 for How Do Large Language Monkeys Get Their Power (Laws)?
Figure 3 for How Do Large Language Monkeys Get Their Power (Laws)?
Figure 4 for How Do Large Language Monkeys Get Their Power (Laws)?
Viaarxiv icon